Introduction

This project focuses on using data as a tool to assess, analyze, and eventually predict housing prices in Mecklenburg County, North Carolina. This sort of robust descriptive data analytics can help inform policy and help decision makers/stakeholders understand the future demand for land, housing, education, and transportation needs of a neighborhood, city, and the greater region at large. This project was formulated using similar regression, and feature modeling from Zillow housing estimates, to train a model from recent sale prices, and the training set, and assess whether our model generalizes to properties in the same areas that have not sold yet.

This project required several machine learning processes for us to better understand the housing data, the variables within the housing data, and any external data sources we decided to add. Data wrangling, exploratory analysis, feature engineering, feature selection, and model estimation and validation were all machine learning processes used as major steps in this predictive modeling exercise.

In order to better understand/explain sale prices in particular areas, Zillow uses feature modeling which requires manipulating both internal housing characteristics and external characteristics. This hedonic model was used in our research approach, to predict sale prices in Mecklenburg as a theoretical framework. It functions as a way to break down the value of a house price into the value of the characteristics that constitute it. Internal housing characteristics such as the number of bathrooms, the age of the house, and the number of bedrooms, for example, were some of the variables used to evaluate which might impact the sale price of a house in the county. Public services/amenities like tree canopy, or spatial processes of clustering of homes around schools or parks, were some external variables used to attempt to create an accurate and generalizable model.

This project was difficult and required us to utilize many of the technical and analytical skills learned thus far in the semester. However, the most difficult was understanding the value of what homebuyers want. Buying a home is such a nuanced process and there are numerous variables, amenities, and disamenities that inform a potential home buyer’s decision. We learned the power of machine learning but also realized there are limitations to predicting how human needs and desires.

Set Up

Mecklenburg Home Price Data

The dataset imported contains the attributes of homes in Mecklenburg, NC such as prices, bedrooms, bathrooms, etc. We will utilize these attributes to inform our model to predict home prices eventually.

HousingData <- 
  st_read("studentData.geojson")%>%
  st_transform('EPSG:2264')

Mecklenburg <- 
  st_read("MecklenburgCounty_Boundary.shp")%>%
  st_transform('EPSG:2264')

Mecklenburg_Tracts <- 
  st_read("npa.shp")%>%
  st_transform('EPSG:2264')

Internal Variables

Starting with the housing data set as it is, we decided to group, manipulate, and create new variables in the dataset solely based on the attributes of the homes. Our new “internal variables” are included in the model to predict home prices.

# Total baths
HousingData <- 
  HousingData %>%
  mutate(totalbaths = (halfbaths *0.5)+(fullbaths))

# Age
HousingData <- 
  HousingData %>%
  mutate(Age = 2022 - yearbuilt) 
 
# Old vs. New Houses, Old = 0 and new = 1 
HousingData$houseoldnew = ifelse(HousingData$"yearbuilt"<2000, 0 , 1)

# Bedroom Groupings, small houses = 1, medium = 2, large = 3 
HousingData <- mutate(HousingData, bedroomgroups = case_when(bedrooms <= 2 ~ 1,
                                                    bedrooms >= 3 & bedrooms <= 4  ~ 2,
                                                    bedrooms >= 5 ~ 3))

# Story groupings, One level = 1, multilevel = 2
HousingData$storygroups = ifelse(HousingData$"storyheigh"== "1 STORY", 1, 2)

Spatial Structure Data

We included some demographics, and socio-economic and environmental data about Mecklenburg, that are spatially tied to census tracts rather than points. The data was extracted from the 2019 5-year ACS and Mecklenburg’s open data source.

Amenities/Public Services Data

We included amenities and services, we thought would impact the variability of housing prices across the county and neighborhoods. We calculated distances from these amenities as well as buffers to each home observation to determine their spatial impact on sale prices. Schools and parks data were sourced from Mecklenburg’s open data source.

# Public schools
PublicSchools <-
  st_read("CMS_Schools.shp")%>%
  st_transform('EPSG:2264')%>%
  dplyr::select(-school_num,-address,-city,-zipcode,-school,-magnet,-grdlevl,-mag_focus,-schl_stat)

PublicSchools1 <-
  st_read("CMS_Schools.shp")%>%
  st_transform('EPSG:2264')%>%
  dplyr::select(-school_num,-address,-city,-zipcode,-school,-magnet,-grdlevl,
                -mag_focus,-schl_stat,-school_typ,)

# Parks
ParkAmenities <-
    st_read("ParkAmenity.shp")%>%
  st_transform('EPSG:2264')%>%
  dplyr::select(geometry)

Inputting external data into the Mecklenburg housing dataset

In order to use the external variables, we need to add them to the housing dataset by expressing how they relate to each home observation. We will utilize buffers and average nearest neighbor distance to measure occurrences of the amenities and the from the homes to the distances to them.

School buffers, within 3 miles

The buffer sums schools within 3 miles of each home sale observation.

HousingData$schools.Buffer =
    st_buffer(HousingData, 15840) %>%
    aggregate(mutate(PublicSchools1, counter = 1),., sum) %>%
    pull(counter)

Average nearest neighbor distance to schools

This calculates the average nearest neighbor distance from each home sale to its k nearest school.

HousingData <-
  HousingData %>% 
    mutate(
     school_nn1 = nn_function(st_coordinates(HousingData), 
                              st_coordinates(PublicSchools1), k = 1),
      
      school_nn2 = nn_function(st_coordinates(HousingData), 
                              st_coordinates(PublicSchools1), k = 2), 
      
      school_nn3 = nn_function(st_coordinates(HousingData), 
                              st_coordinates(PublicSchools1), k = 3), 
      
      school_nn4 = nn_function(st_coordinates(HousingData), 
                              st_coordinates(PublicSchools1), k = 4), 
      
      school_nn5 = nn_function(st_coordinates(HousingData), 
                              st_coordinates(PublicSchools1), k = 5))
ggplot() + geom_sf(data = Mecklenburg_Tracts, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(PublicSchools1)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#ecda9a", high= "#ee4d5a", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Schools, Mecklenburg County") +
  mapTheme()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

Park buffers, within 1/2 miles

The buffer sums parks within 1/2 mile of each home sale observation.

HousingData$Park.Buffer =
    st_buffer(HousingData, 2640) %>%
    aggregate(mutate(ParkAmenities, counter = 1),., sum) %>%
    pull(counter)

HousingData$Park.Buffer[is.na(HousingData$Park.Buffer)] <- 0

Mapping Exploratory Variables

Mecklenburg Home Prices

This map illustrates the distribution of home prices across the county. From this map, you can notice some patterns of spatial clustering of higher and lower home prices which we hope to capture in our model.

ggplot() +
  geom_sf(data = Mecklenburg_Tracts, fill = "grey") +
  geom_sf(data = Mecklenburg_Tracts, color = "white") +
  geom_sf(data = HousingData, aes(colour = q5(price)), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = (palette11),
                   labels=qBr(HousingData,"price"),
                   name="Quintile\nBreaks") +
  labs(title= "Home Prices", subtitle = "Mecklenburg County, NC") + 
  guides(color = guide_legend(override.aes = list(size = 5)))+
  mapTheme()

Internal Variables Map

This is a map of one of the internal housing characteristics that were included in our model.

ggplot() +
  geom_sf(data = Mecklenburg_Tracts, fill = "grey") +
  geom_sf(data = Mecklenburg_Tracts, color = "white") +
  geom_sf(data = HousingData, aes(colour = bldggrade), 
          show.legend = "point", size = .25) +
   scale_colour_manual(values = rev(palette13)) +
  labs(title= "Home Quality Description", subtitle = "Mecklenburg County, NC") + 
  guides(color = guide_legend(override.aes = list(size = 5)))+
  mapTheme()

Amenities/Service Maps

The maps below show the distribution of amenities/services overlapped with home prices in the county.

ggplot() +
  geom_sf(data = Mecklenburg_Tracts, fill = "grey") +
  geom_sf(data = Mecklenburg_Tracts, color = "white") +
   geom_sf(data = HousingData, aes(colour = q5(price)), 
          show.legend = "point", size = .25, alpha = .15 )+
  scale_colour_manual(values = (palette11),
                   labels=qBr(HousingData,"price"),
                   name="Quintile\nBreaks") +
  geom_sf(data = PublicSchools, size = 1.25, color = "#589adb") +
  labs(title= "Public Schools", subtitle = "Mecklenburg County, NC") + 
  guides(color = guide_legend(override.aes = list(size = 5)))+
  mapTheme()

ggplot() +
  geom_sf(data = Mecklenburg_Tracts, fill = "grey") +
  geom_sf(data = Mecklenburg_Tracts, color = "white") +
   geom_sf(data = HousingData, aes(colour = q5(price)),
           show.legend = "point", size = .25, alpha = .15 )+
  scale_colour_manual(values = (palette11),
                   labels=qBr(HousingData,"price"),
                   name="Quintile\nBreaks") +
  geom_sf(data = ParkAmenities, size = 1.0, color = "#54bebe") +
    labs(title= "Park Amenities", subtitle = "Mecklenburg County") +
    guides(color = guide_legend(override.aes = list(size = 5)))+
  mapTheme()

Spatial Structure Maps

The maps below illustrate the relationship between spatial variables and home prices across the county.

ggplot() +
   geom_sf(data = pctBachelor, aes(fill = (Bachelor_2020)))+
  labs(title= "% of Bachelor Degree Holders", subtitle = "Mecklenburg County, NC") + 
   geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
           show.legend = "point", size = .10)+
  scale_colour_manual(values = (palette11),
                   labels=qBr(HousingData,"price"),
                   name="Quintile\nBreaks") +
  guides(color = guide_legend(override.aes = list(size = 5)))+
  mapTheme()

ggplot() +
   geom_sf(data = Income, aes(fill = (Income_2020)))+
  labs(title= "Median Income Distribution", subtitle = "Mecklenburg County, NC") + 
   geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
           show.legend = "point", size = .10)+
  scale_colour_manual(values = (palette11),
                   labels=qBr(HousingData,"price"),
                   name="Quintile\nBreaks") +
  guides(color = guide_legend(override.aes = list(size = 5)))+
  mapTheme()

Interesting Maps

These final maps include spatial variables whose relationship with home prices appeared to be interesting. For example, looking at the maps illustrating the percentage of Black/White residents overlapped with the home prices, you can immediately see how home prices differ by race in census tracts with a majority of Black/White residents.

ggplot() +
   geom_sf(data = pctBlack, aes(fill = (Black_2020)))+
  labs(title= "% of Black Residents", subtitle = "Mecklenburg County, NC") + 
   geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
           show.legend = "point", size = .10)+
  scale_colour_manual(values = (palette11),
                   labels=qBr(HousingData,"price"),
                   name="Quintile\nBreaks") +
  guides(color = guide_legend(override.aes = list(size = 5)))+
  mapTheme()

ggplot() +
   geom_sf(data = pctWhite, aes(fill = (White_2020)))+
  labs(title= "% of White Residents", subtitle = "Mecklenburg County, NC") + 
   geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
           show.legend = "point", size = .10)+
  scale_colour_manual(values = (palette11),
                   labels=qBr(HousingData,"price"),
                   name="Quintile\nBreaks") +
  guides(color = guide_legend(override.aes = list(size = 5)))+
  mapTheme()

ggplot() +
   geom_sf(data = pctTreeCanopy, aes(fill = (Treecanopy_2012)))+
  labs(title= "% Tree Cover", subtitle = "Mecklenburg County, NC") + 
   geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
           show.legend = "point", size = .10)+
  scale_colour_manual(values = (palette11),
                   labels=qBr(HousingData,"price"),
                   name="Quintile\nBreaks") +
  guides(color = guide_legend(override.aes = list(size = 5)))+
  mapTheme()

Summary Statistics

The tables below show summary statistics for our internal and external values.

# Internal variables
ss_Internal <- HousingData 

ss_Internal <- st_drop_geometry(ss_Internal)

ss_Internal <- ss_Internal %>%
  dplyr::select("totalbaths", "price", "fullbaths", "halfbaths", "Age", "bedrooms", "numfirepla", "bldggrade", "heatedfuel", "bedroomgroups", "yearbuilt", "actype", "extwall", "foundation", "storyheigh", "heatedarea", "houseoldnew") 

stargazer(as.data.frame(ss_Internal), type="text", digits=1, title = "Descriptive Statistics for Internal Variables of Homes in Meckenburg, NC", out = "Training_Internal.txt")
## 
## Descriptive Statistics for Internal Variables of Homes in Meckenburg, NC
## =======================================================
## Statistic       N      Mean    St. Dev.  Min    Max    
## -------------------------------------------------------
## totalbaths    46,183    2.6       0.9    0.0    11.0   
## price         46,183 465,686.8 602,409.8  0  75,500,000
## fullbaths     46,183    2.3       0.8     0      9     
## halfbaths     46,183    0.6       0.5     0      4     
## Age           46,183   28.6      31.8     0    2,022   
## bedrooms      46,183    3.5       0.9     0      65    
## numfirepla    46,183    0.8       0.5     0      7     
## bedroomgroups 46,183    2.1       0.4     1      3     
## yearbuilt     46,183  1,993.4    31.8     0    2,022   
## heatedarea    46,183  2,359.4   1,060.3  0.0  14,710.0 
## houseoldnew   46,183    0.5       0.5     0      1     
## -------------------------------------------------------
# Amenities/Services
ss_Amenities<- HousingData 

ss_Amenities <- st_drop_geometry(ss_Amenities)

ss_Amenities <- ss_Amenities %>%
  dplyr::select("school_nn1", "Park.Buffer")

stargazer(as.data.frame(ss_Amenities), type="text", digits=1, title="Descriptive Statistics for Amenities and Services near Homes in Meckenburg, NC", out = "Training_External.txt")
## 
## Descriptive Statistics for Amenities and Services near Homes in Meckenburg, NC
## ==================================================
## Statistic     N     Mean   St. Dev.  Min    Max   
## --------------------------------------------------
## school_nn1  46,183 4,759.2 3,070.7  117.0 22,481.3
## Park.Buffer 46,183   5.5     10.8     0     132   
## --------------------------------------------------
# Spatial Structure
ss_Spatial<- HousingData 

ss_Spatial <- st_drop_geometry(ss_Spatial)

ss_Spatial <- ss_Spatial %>%
  dplyr::select("Black_2020", "White_2020", "Income_2020", "Treecanopy_2012", "Bachelor_2020")

stargazer(as.data.frame(ss_Spatial), type="text", digits=1, title="Descriptive Statistics for Spatial Structures near Homes in Meckenburg, NC", out = "Training_External2.txt")
## 
## Descriptive Statistics for Spatial Structures near Homes in Meckenburg, NC
## =======================================================
## Statistic         N      Mean   St. Dev.  Min     Max  
## -------------------------------------------------------
## Black_2020      46,181   27.0     21.2    0.4    87.2  
## White_2020      46,181   49.5     26.2    1.5    93.3  
## Income_2020     45,201 89,895.1 42,113.8 10,797 250,001
## Treecanopy_2012 46,165   54.0     12.4    1.6    77.0  
## Bachelor_2020   46,181   47.2     20.8    4.2    91.6  
## -------------------------------------------------------

Preparing for Regressions, Creating Challenge Set

In order to train and test our model, we remove the rows where the home prices are zeros to prepare our model for prediction. The challenge set will be made of 100 variables with no home prices which we will predict after our model is trained and tested.

# Challenge set
Challenge_Meclenburg<- filter(HousingData, price == 0) 

# Remove the zeros
HousingData_nozero <-HousingData[!(HousingData$price== 0),]

Preliminary Analysis, Correlations

The following graphs will begin to analyze the relationship between our variables and home sale prices. These graphs will be helpful to determine significant variables in our data set that should be added to our model.

# Home Features cor
st_drop_geometry(HousingData_nozero) %>%
  dplyr::select(price, Age, bedrooms, numfirepla) %>%
  filter(price <= 8000000, Age<200, bedrooms<10) %>%
  gather(Variable, Value, -price) %>% 
ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "pink") +
     facet_wrap(~Variable, ncol = 4, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     plotTheme()

#bldggrade, heatedfuel,bedroomgroups, storygroups
st_drop_geometry(HousingData_nozero) %>%
  dplyr::select(price, bedroomgroups, storygroups, totalbaths) %>%
  filter(price <= 8000000) %>%
  gather(Variable, Value, -price) %>% 
ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "pink") +
     facet_wrap(~Variable, ncol = 4, scales = "free") +
     labs(title = "Price as a function of continuous variables") + 
     plotTheme()

# School Corr
HousingData_nozero %>%
  st_drop_geometry() %>%
  dplyr::select(price, starts_with("school_")) %>%
  filter(price <= 8000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "pink") +
     facet_wrap(~Variable, nrow = 1, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     plotTheme()

Correlation Matrix

numericVars <- 
  select_if(st_drop_geometry(HousingData_nozero), is.numeric) %>% na.omit()
  ggcorrplot(
  round(cor(numericVars), 1),
  tl.cex=5,
  p.mat = cor_pmat(numericVars),
  colors = c("#f3e79b","#eb7f86","#5c53a5","#a059a0", "#fac484"), 
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables") 

#palette11<- c("#f3e79b","#fac484","#f8a07e","#eb7f86","#ce6693","#a059a0","#5c53a5")

Initial Regression Model

Our Initial regression model includes the adjusted internal and external variables that have been added, as well as the attributes of the home that were present in the housing data. The adjusted R-squared values show that this model accounts for 24% of the variation in home sale prices.

## 
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(HousingData_nozero) %>% 
##     dplyr::select(price, fullbaths, halfbaths, Age, bedrooms, 
##         numfirepla, bldggrade, heatedfuel, bedroomgroups, yearbuilt, 
##         actype, extwall, foundation, storyheigh, heatedarea, 
##         houseoldnew, Income_2020, Treecanopy_2012, White_2020, 
##         Black_2020, Bachelor_2020, Park.Buffer, schools.Buffer))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2379200   -83657   -15067    56436 74496603 
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -3.742e+05  3.032e+05  -1.234 0.217241    
## fullbaths                     3.011e+04  5.841e+03   5.155 2.54e-07 ***
## halfbaths                     2.735e+04  6.384e+03   4.285 1.83e-05 ***
## Age                           1.606e+03  2.269e+02   7.076 1.50e-12 ***
## bedrooms                      3.988e+04  4.227e+03   9.435  < 2e-16 ***
## numfirepla                   -5.384e+03  6.188e+03  -0.870 0.384254    
## bldggradeCUSTOM               1.493e+06  3.957e+04  37.741  < 2e-16 ***
## bldggradeEXCELLENT            7.139e+05  2.213e+04  32.262  < 2e-16 ***
## bldggradeFAIR                -2.748e+04  2.123e+04  -1.294 0.195554    
## bldggradeGOOD                 5.553e+04  7.579e+03   7.327 2.39e-13 ***
## bldggradeMINIMUM             -5.280e+04  1.147e+05  -0.460 0.645356    
## bldggradeVERY GOOD            2.581e+05  1.238e+04  20.844  < 2e-16 ***
## heatedfuelGAS                -1.841e+04  7.977e+03  -2.308 0.021011 *  
## heatedfuelNONE               -2.393e+04  6.721e+04  -0.356 0.721854    
## heatedfuelOIL/WD/COAL         6.569e+04  3.584e+04   1.833 0.066788 .  
## heatedfuelSOLAR/GEOTHRM      -1.237e+04  1.889e+05  -0.066 0.947760    
## bedroomgroups                -5.091e+04  9.582e+03  -5.313 1.08e-07 ***
## yearbuilt                            NA         NA      NA       NA    
## actypeAC-CHLD WAT            -1.025e+05  3.013e+05  -0.340 0.733608    
## actypeAC-NONE                -4.309e+04  1.536e+04  -2.805 0.005037 ** 
## actypeAC-PCKD ROOF            1.054e+04  5.216e+05   0.020 0.983881    
## actypeAC-WALL UNIT           -4.803e+04  9.150e+04  -0.525 0.599641    
## actypeAIR-DUCTED             -3.024e+03  7.965e+04  -0.038 0.969717    
## actypeAIR-NO-DUCT            -3.738e+04  1.845e+05  -0.203 0.839443    
## actypeHEAT - NONE             1.117e+05  3.750e+05   0.298 0.765878    
## actypeHEAT PUMP               5.458e+04  6.810e+04   0.801 0.422857    
## extwallASB SHNG/SDG          -8.251e+04  4.292e+04  -1.922 0.054566 .  
## extwallBOARD&BATTEN           9.352e+04  8.725e+04   1.072 0.283813    
## extwallCEDAR,RDWD             8.334e+04  2.675e+04   3.116 0.001837 ** 
## extwallCEM BR/SPL B          -2.630e+04  9.112e+04  -0.289 0.772906    
## extwallCOMP OR WLBD          -3.447e+04  2.337e+05  -0.148 0.882737    
## extwallCONC BLOCK            -3.424e+03  9.175e+04  -0.037 0.970230    
## extwallCORR MTL LGT          -1.837e+05  3.890e+05  -0.472 0.636768    
## extwallEXT PLYWOOD            1.109e+05  4.366e+04   2.539 0.011123 *  
## extwallFACE BRICK            -1.567e+04  8.586e+03  -1.825 0.067976 .  
## extwallHARDIPLK/DSGN VINYL    5.783e+03  8.311e+03   0.696 0.486574    
## extwallIMITATION STONE        1.413e+05  3.690e+05   0.383 0.701706    
## extwallLOG                    4.057e+04  1.449e+05   0.280 0.779418    
## extwallMASONITE               2.747e+03  1.010e+04   0.272 0.785545    
## extwallMODULAR MTL           -8.691e+04  3.095e+05  -0.281 0.778845    
## extwallPREFIN MTL            -5.713e+04  3.015e+05  -0.190 0.849691    
## extwallSDG MIN/NONE          -1.995e+03  3.689e+05  -0.005 0.995685    
## extwallSIDG NO SHTG          -8.505e+04  1.275e+05  -0.667 0.504625    
## extwallSTONE                  2.452e+05  1.091e+05   2.246 0.024687 *  
## extwallSTUCCO HRDCT           4.713e+04  2.354e+04   2.003 0.045217 *  
## extwallSTUCCO SYNTH           2.844e+04  5.328e+04   0.534 0.593525    
## extwallWOOD ON SHTG          -5.140e+03  1.381e+04  -0.372 0.709689    
## extwallWOOD SHINGLE           9.147e+04  4.246e+04   2.154 0.031232 *  
## foundationCRAWL SPACE         2.239e+04  1.913e+04   1.170 0.241883    
## foundationPIER-NO FOUND WALL  3.057e+05  2.468e+05   1.239 0.215461    
## foundationSLAB-ABV GRD        7.813e+04  7.859e+04   0.994 0.320152    
## foundationSLAB-COM            4.041e+04  2.341e+05   0.173 0.862937    
## foundationSLAB-RES            3.051e+04  1.977e+04   1.543 0.122753    
## storyheigh1 STORY             1.997e+05  3.016e+05   0.662 0.507747    
## storyheigh1.5 STORY           1.962e+05  3.016e+05   0.650 0.515450    
## storyheigh2.0 STORY           1.386e+05  3.015e+05   0.460 0.645737    
## storyheigh2.5 STORY           1.116e+05  3.018e+05   0.370 0.711698    
## storyheigh3.0 STORY           4.713e+03  3.081e+05   0.015 0.987795    
## storyheighBI-LEVEL            1.094e+05  3.042e+05   0.360 0.719126    
## storyheighCAPE COD            1.602e+05  3.233e+05   0.495 0.620264    
## storyheighRANCH W/BSMT        1.501e+05  3.026e+05   0.496 0.619759    
## storyheighSPLIT LEVEL         1.338e+05  3.021e+05   0.443 0.657868    
## heatedarea                    1.101e+02  5.322e+00  20.688  < 2e-16 ***
## houseoldnew                   3.721e+04  9.101e+03   4.089 4.34e-05 ***
## Income_2020                   3.773e-01  1.084e-01   3.479 0.000503 ***
## Treecanopy_2012              -1.419e+03  2.401e+02  -5.907 3.50e-09 ***
## White_2020                    2.129e+03  3.054e+02   6.972 3.17e-12 ***
## Black_2020                    3.880e+02  3.082e+02   1.259 0.208100    
## Bachelor_2020                 2.810e+02  2.440e+02   1.152 0.249463    
## Park.Buffer                   2.143e+02  2.500e+02   0.857 0.391330    
## schools.Buffer                8.114e+03  4.969e+02  16.331  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 521500 on 44501 degrees of freedom
##   (1512 observations deleted due to missingness)
## Multiple R-squared:  0.2432, Adjusted R-squared:  0.2421 
## F-statistic: 207.3 on 69 and 44501 DF,  p-value: < 2.2e-16
## 
## First Linear Model of Dataset
## ========================================================
##                                  Dependent variable:    
##                              ---------------------------
##                                         price           
## --------------------------------------------------------
## fullbaths                            30,114.0***        
##                                       (5,841.3)         
##                                                         
## halfbaths                            27,350.8***        
##                                       (6,383.6)         
##                                                         
## Age                                  1,605.7***         
##                                        (226.9)          
##                                                         
## bedrooms                             39,882.4***        
##                                       (4,227.3)         
##                                                         
## numfirepla                            -5,383.9          
##                                       (6,187.8)         
##                                                         
## bldggradeCUSTOM                    1,493,290.0***       
##                                      (39,566.8)         
##                                                         
## bldggradeEXCELLENT                  713,937.0***        
##                                      (22,129.5)         
##                                                         
## bldggradeFAIR                         -27,481.3         
##                                      (21,231.9)         
##                                                         
## bldggradeGOOD                        55,532.8***        
##                                       (7,579.0)         
##                                                         
## bldggradeMINIMUM                      -52,796.2         
##                                      (114,717.7)        
##                                                         
## bldggradeVERY GOOD                  258,062.6***        
##                                      (12,380.9)         
##                                                         
## heatedfuelGAS                        -18,409.7**        
##                                       (7,976.9)         
##                                                         
## heatedfuelNONE                        -23,927.3         
##                                      (67,214.6)         
##                                                         
## heatedfuelOIL/WD/COAL                 65,694.1*         
##                                      (35,836.9)         
##                                                         
## heatedfuelSOLAR/GEOTHRM               -12,374.6         
##                                      (188,865.5)        
##                                                         
## bedroomgroups                       -50,910.0***        
##                                       (9,582.1)         
##                                                         
## yearbuilt                                               
##                                                         
##                                                         
## actypeAC-CHLD WAT                    -102,532.3         
##                                      (301,271.0)        
##                                                         
## actypeAC-NONE                       -43,087.9***        
##                                      (15,362.4)         
##                                                         
## actypeAC-PCKD ROOF                    10,537.4          
##                                      (521,566.2)        
##                                                         
## actypeAC-WALL UNIT                    -48,029.6         
##                                      (91,498.8)         
##                                                         
## actypeAIR-DUCTED                      -3,024.0          
##                                      (79,653.2)         
##                                                         
## actypeAIR-NO-DUCT                     -37,375.6         
##                                      (184,472.8)        
##                                                         
## actypeHEAT - NONE                     111,654.2         
##                                      (374,964.3)        
##                                                         
## actypeHEAT PUMP                       54,579.7          
##                                      (68,098.5)         
##                                                         
## extwallASB SHNG/SDG                  -82,509.8*         
##                                      (42,921.0)         
##                                                         
## extwallBOARD&BATTEN                   93,517.0          
##                                      (87,251.8)         
##                                                         
## extwallCEDAR,RDWD                    83,344.6***        
##                                      (26,750.4)         
##                                                         
## extwallCEM BR/SPL B                   -26,295.3         
##                                      (91,120.6)         
##                                                         
## extwallCOMP OR WLBD                   -34,473.8         
##                                      (233,718.5)        
##                                                         
## extwallCONC BLOCK                     -3,424.0          
##                                      (91,745.9)         
##                                                         
## extwallCORR MTL LGT                  -183,711.3         
##                                      (389,033.1)        
##                                                         
## extwallEXT PLYWOOD                   110,853.6**        
##                                      (43,662.0)         
##                                                         
## extwallFACE BRICK                    -15,671.7*         
##                                       (8,586.2)         
##                                                         
## extwallHARDIPLK/DSGN VINYL             5,782.8          
##                                       (8,311.4)         
##                                                         
## extwallIMITATION STONE                141,335.7         
##                                      (369,002.9)        
##                                                         
## extwallLOG                            40,572.1          
##                                      (144,859.1)        
##                                                         
## extwallMASONITE                        2,746.9          
##                                      (10,095.3)         
##                                                         
## extwallMODULAR MTL                    -86,905.7         
##                                      (309,464.6)        
##                                                         
## extwallPREFIN MTL                     -57,130.5         
##                                      (301,458.1)        
##                                                         
## extwallSDG MIN/NONE                   -1,995.1          
##                                      (368,919.0)        
##                                                         
## extwallSIDG NO SHTG                   -85,050.7         
##                                      (127,467.3)        
##                                                         
## extwallSTONE                         245,184.7**        
##                                      (109,148.4)        
##                                                         
## extwallSTUCCO HRDCT                  47,134.3**         
##                                      (23,535.5)         
##                                                         
## extwallSTUCCO SYNTH                   28,436.2          
##                                      (53,277.3)         
##                                                         
## extwallWOOD ON SHTG                   -5,140.0          
##                                      (13,806.8)         
##                                                         
## extwallWOOD SHINGLE                  91,471.0**         
##                                      (42,462.3)         
##                                                         
## foundationCRAWL SPACE                 22,385.4          
##                                      (19,127.8)         
##                                                         
## foundationPIER-NO FOUND WALL          305,659.0         
##                                      (246,757.0)        
##                                                         
## foundationSLAB-ABV GRD                78,131.3          
##                                      (78,590.5)         
##                                                         
## foundationSLAB-COM                    40,406.2          
##                                      (234,051.0)        
##                                                         
## foundationSLAB-RES                    30,511.9          
##                                      (19,769.9)         
##                                                         
## storyheigh1 STORY                     199,742.5         
##                                      (301,564.3)        
##                                                         
## storyheigh1.5 STORY                   196,172.1         
##                                      (301,626.8)        
##                                                         
## storyheigh2.0 STORY                   138,600.3         
##                                      (301,504.3)        
##                                                         
## storyheigh2.5 STORY                   111,551.2         
##                                      (301,833.0)        
##                                                         
## storyheigh3.0 STORY                    4,712.6          
##                                      (308,058.6)        
##                                                         
## storyheighBI-LEVEL                    109,410.8         
##                                      (304,231.6)        
##                                                         
## storyheighCAPE COD                    160,207.8         
##                                      (323,338.4)        
##                                                         
## storyheighRANCH W/BSMT                150,135.4         
##                                      (302,573.2)        
##                                                         
## storyheighSPLIT LEVEL                 133,766.5         
##                                      (302,050.8)        
##                                                         
## heatedarea                            110.1***          
##                                         (5.3)           
##                                                         
## houseoldnew                          37,214.9***        
##                                       (9,101.2)         
##                                                         
## Income_2020                            0.4***           
##                                         (0.1)           
##                                                         
## Treecanopy_2012                      -1,418.6***        
##                                        (240.1)          
##                                                         
## White_2020                           2,129.0***         
##                                        (305.4)          
##                                                         
## Black_2020                              388.0           
##                                        (308.2)          
##                                                         
## Bachelor_2020                           281.0           
##                                        (244.0)          
##                                                         
## Park.Buffer                             214.3           
##                                        (250.0)          
##                                                         
## schools.Buffer                       8,114.2***         
##                                        (496.9)          
##                                                         
## Constant                             -374,172.5         
##                                      (303,240.0)        
##                                                         
## --------------------------------------------------------
## Observations                           44,571           
## R2                                       0.2            
## Adjusted R2                              0.2            
## Residual Std. Error            521,476.8 (df = 44501)   
## F Statistic                   207.3*** (df = 69; 44501) 
## ========================================================
## Note:                        *p<0.1; **p<0.05; ***p<0.01

Splitting Data into Training and Test Sets

We split 70% of our data into the training set and 30% into the test set.

set.seed(3456)
trainIndex <- createDataPartition(HousingData_nozero$price, p = .7,
                                  list = FALSE,
                                  times = 1)
Train <- HousingData_nozero[ trainIndex,]
Test <- HousingData_nozero[-trainIndex,]

Regresion Model

The adjusted R-Squared values indicate that our model accounts for 21% of th variance in home sale prices.

## 
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(Train) %>% dplyr::select(price, 
##     fullbaths, halfbaths, Age, bedrooms, numfirepla, bldggrade, 
##     heatedfuel, bedroomgroups, yearbuilt, actype, extwall, foundation, 
##     storyheigh, heatedarea, houseoldnew, Income_2020, Treecanopy_2012, 
##     White_2020, Black_2020, Bachelor_2020, Park.Buffer, schools.Buffer))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3549771   -86480   -15434    57713 74234971 
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -3.930e+05  3.253e+05  -1.208  0.22692    
## fullbaths                     3.005e+04  7.470e+03   4.023 5.76e-05 ***
## halfbaths                     3.210e+04  8.181e+03   3.924 8.73e-05 ***
## Age                           1.605e+03  2.930e+02   5.479 4.31e-08 ***
## bedrooms                      5.922e+04  5.314e+03  11.144  < 2e-16 ***
## numfirepla                   -6.385e+03  7.922e+03  -0.806  0.42024    
## bldggradeCUSTOM               1.540e+06  4.981e+04  30.918  < 2e-16 ***
## bldggradeEXCELLENT            7.050e+05  2.839e+04  24.831  < 2e-16 ***
## bldggradeFAIR                -2.024e+04  2.740e+04  -0.739  0.46008    
## bldggradeGOOD                 5.847e+04  9.668e+03   6.048 1.48e-09 ***
## bldggradeMINIMUM             -4.111e+04  1.450e+05  -0.283  0.77683    
## bldggradeVERY GOOD            2.735e+05  1.577e+04  17.340  < 2e-16 ***
## heatedfuelGAS                -2.520e+04  1.022e+04  -2.465  0.01371 *  
## heatedfuelNONE               -2.244e+04  8.908e+04  -0.252  0.80113    
## heatedfuelOIL/WD/COAL         1.065e+05  4.613e+04   2.308  0.02102 *  
## heatedfuelSOLAR/GEOTHRM       8.156e+02  2.164e+05   0.004  0.99699    
## bedroomgroups                -6.625e+04  1.224e+04  -5.412 6.27e-08 ***
## yearbuilt                            NA         NA      NA       NA    
## actypeAC-CHLD WAT            -1.020e+05  3.222e+05  -0.317  0.75158    
## actypeAC-NONE                -4.035e+04  1.958e+04  -2.061  0.03935 *  
## actypeAC-WALL UNIT           -4.223e+04  1.105e+05  -0.382  0.70236    
## actypeAIR-DUCTED              4.267e+03  1.020e+05   0.042  0.96662    
## actypeAIR-NO-DUCT            -5.988e+04  2.496e+05  -0.240  0.81039    
## actypeHEAT PUMP              -2.078e+04  9.456e+04  -0.220  0.82604    
## extwallASB SHNG/SDG          -1.048e+05  5.546e+04  -1.889  0.05885 .  
## extwallBOARD&BATTEN           8.177e+04  1.167e+05   0.701  0.48362    
## extwallCEDAR,RDWD             8.767e+04  3.531e+04   2.483  0.01304 *  
## extwallCEM BR/SPL B          -1.533e+04  1.283e+05  -0.119  0.90492    
## extwallCOMP OR WLBD           3.825e+03  2.790e+05   0.014  0.98906    
## extwallCONC BLOCK            -7.168e+04  1.265e+05  -0.567  0.57092    
## extwallCORR MTL LGT          -1.850e+05  4.161e+05  -0.445  0.65668    
## extwallEXT PLYWOOD            1.675e+05  5.500e+04   3.045  0.00233 ** 
## extwallFACE BRICK            -1.980e+04  1.102e+04  -1.797  0.07230 .  
## extwallHARDIPLK/DSGN VINYL    7.425e+03  1.061e+04   0.700  0.48385    
## extwallIMITATION STONE        2.238e+05  5.579e+05   0.401  0.68826    
## extwallLOG                    2.113e+04  1.861e+05   0.114  0.90962    
## extwallMASONITE              -9.719e+03  1.292e+04  -0.752  0.45190    
## extwallMODULAR MTL           -1.030e+05  3.325e+05  -0.310  0.75670    
## extwallPREFIN MTL            -8.059e+04  3.225e+05  -0.250  0.80268    
## extwallSDG MIN/NONE          -2.092e+04  5.582e+05  -0.037  0.97010    
## extwallSIDG NO SHTG          -9.149e+04  1.621e+05  -0.564  0.57251    
## extwallSTONE                  3.197e+05  1.552e+05   2.060  0.03940 *  
## extwallSTUCCO HRDCT           6.524e+04  3.019e+04   2.161  0.03071 *  
## extwallSTUCCO SYNTH           3.258e+04  6.777e+04   0.481  0.63072    
## extwallWOOD ON SHTG          -8.937e+03  1.756e+04  -0.509  0.61075    
## extwallWOOD SHINGLE           8.514e+04  5.306e+04   1.605  0.10859    
## foundationCRAWL SPACE         3.276e+04  2.447e+04   1.339  0.18064    
## foundationPIER-NO FOUND WALL  3.088e+05  2.643e+05   1.168  0.24267    
## foundationSLAB-ABV GRD        1.105e+05  1.017e+05   1.087  0.27700    
## foundationSLAB-COM            9.801e+04  3.952e+05   0.248  0.80412    
## foundationSLAB-RES            3.678e+04  2.529e+04   1.454  0.14583    
## storyheigh1 STORY             2.030e+05  3.227e+05   0.629  0.52935    
## storyheigh1.5 STORY           1.774e+05  3.228e+05   0.550  0.58262    
## storyheigh2.0 STORY           1.321e+05  3.226e+05   0.410  0.68216    
## storyheigh2.5 STORY           1.032e+05  3.231e+05   0.319  0.74947    
## storyheigh3.0 STORY          -1.794e+04  3.321e+05  -0.054  0.95693    
## storyheighBI-LEVEL            1.225e+05  3.271e+05   0.375  0.70802    
## storyheighCAPE COD            1.416e+05  3.515e+05   0.403  0.68709    
## storyheighRANCH W/BSMT        1.567e+05  3.243e+05   0.483  0.62884    
## storyheighSPLIT LEVEL         1.254e+05  3.234e+05   0.388  0.69816    
## heatedarea                    1.001e+02  6.861e+00  14.595  < 2e-16 ***
## houseoldnew                   4.540e+04  1.166e+04   3.893 9.91e-05 ***
## Income_2020                   2.835e-01  1.381e-01   2.053  0.04011 *  
## Treecanopy_2012              -1.485e+03  3.075e+02  -4.831 1.37e-06 ***
## White_2020                    2.138e+03  3.915e+02   5.460 4.79e-08 ***
## Black_2020                    4.270e+02  3.957e+02   1.079  0.28049    
## Bachelor_2020                 5.393e+02  3.116e+02   1.731  0.08353 .  
## Park.Buffer                   2.029e+02  3.205e+02   0.633  0.52660    
## schools.Buffer                8.153e+03  6.365e+02  12.809  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 557600 on 31095 degrees of freedom
##   (1096 observations deleted due to missingness)
## Multiple R-squared:  0.2215, Adjusted R-squared:  0.2198 
## F-statistic:   132 on 67 and 31095 DF,  p-value: < 2.2e-16

Accuracy - Mean Absolute Error

Our mean MAE - price absolute error was 139,410 and our MAPE was 34%.

## Error in `stopifnot()`:
## ! Problem while computing `Price.Predict = predict(reg_training, Test)`.
## Caused by error in `model.frame.default()`:
## ! factor actype has new levels AC-PCKD ROOF, HEAT - NONE

Generalizability - Cross-validation

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(Train) %>% 
                                dplyr::select(price, fullbaths, halfbaths, Age, bedrooms, numfirepla, bedroomgroups, yearbuilt,  heatedarea, houseoldnew, Income_2020, Treecanopy_2012, White_2020, Black_2020, Bachelor_2020, Park.Buffer, schools.Buffer), 
     method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv
## Linear Regression 
## 
## 32259 samples
##    16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 31936, 31935, 31937, 31937, 31939, 31935, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   316154.6  0.564399  136358.2
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv$resample[1:5,]
##       RMSE  Rsquared      MAE Resample
## 1 303513.5 0.5898570 148043.9  Fold001
## 2 234318.9 0.6623346 144317.7  Fold002
## 3 178858.0 0.6597736 120258.3  Fold003
## 4 170824.6 0.6117729 117214.9  Fold004
## 5 250626.6 0.6431760 129566.3  Fold005
mean(reg.cv$resample[,3])
## [1] 136358.2
## 
## Cross Validation Results
## =======================================================
## Statistic  N    Mean    St. Dev.     Min        Max    
## -------------------------------------------------------
## RMSE      100 316,154.6 481,573.2 147,676.6 4,276,068.0
## Rsquared  100    0.6       0.2      0.01        0.7    
## MAE       100 136,358.2 29,890.2  110,452.0  358,648.6 
## -------------------------------------------------------
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Spatial Lag

coords <- st_coordinates(Train) 
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
Train$lagPrice <- lag.listw(spatialWeights, Train$price)

coords.test <-  st_coordinates(Test) 
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")

Moran’s I

Our Moran’s I is 0.27, with a p-value of 0.001. The Moran’s I output suggests that there is no autocorrelation. However, a p-value of 0.001 suggests that the observed point process is more clustered than all 999 random permutations (1 / 999 = 0.001) and is statistically significant.

Test_lag <- 
  Test %>%
  mutate(lagPriceError = lag.listw(spatialWeights.test, Price.Error, NAOK = TRUE))
## Error in `stopifnot()`:
## ! Problem while computing `lagPriceError =
##   lag.listw(spatialWeights.test, Price.Error, NAOK = TRUE)`.
## Caused by error in `lag.listw()`:
## ! object 'Price.Error' not found
Test_filter <- Test_lag %>% filter(Price.Error >0,lagPriceError >0)
## Error in filter(., Price.Error > 0, lagPriceError > 0): object 'Test_lag' not found
## Error in ggplot(data = Test_filter, aes(lagPriceError, Price.Error)): object 'Test_filter' not found

MoranTest <-moran.mc(na.omit(Test_lag$Price.Error),
                     spatialWeights.test, nsim = 999, zero.policy = TRUE)
## Error in na.omit(Test_lag$Price.Error): object 'Test_lag' not found
ggplot(as.data.frame(MoranTest$res[c(1:999)]), aes(MoranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = MoranTest$statistic), colour = "pink",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in pink",
       x="Moran's I",
       y="Count") +
  plotTheme()
## Error in as.data.frame(MoranTest$res[c(1:999)]): object 'MoranTest' not found

Neighborhood (Census tracts)

We included a spatial variable to account for the spatial process at the neighborhood or local scale. We used census tracts to organize the grouping of homes into “neighborhoods”, to understand the neighborhood effects on home prices.

Test_lag %>%
  st_drop_geometry() %>%
    group_by(NAME)%>% summarize(meanPrice = mean(price, na.rm = T), 
            meanPrediction = mean(Price.Predict)) %>%
      kable(caption = "Price and Prediction by block group") %>% kable_material_dark("basic", full_width = F) %>% scroll_box(width = "800px", height = "500px")
## Error in st_drop_geometry(.): object 'Test_lag' not found
ggplot() +
  geom_sf(data = Mecklenburg, fill = "grey89", color = "grey89") +
  geom_sf(data = Test_lag, aes(colour = q5(Price.Predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette11,
                   labels=qBr(Test_lag,"Price.Predict"),
                   name="Quintile\nBreaks") +
  labs(title="Predicted Sale Price,  Mecklenburg County, NC") +
  mapTheme()
## Error in fortify(data): object 'Test_lag' not found

Test_lag <-
  Test_lag %>%
  group_by(NAME) %>%
  mutate(MeanPrice = mean(price))
## Error in group_by(., NAME): object 'Test_lag' not found
Test_lag <-
  Test_lag %>%
  group_by(NAME) %>%
  mutate(MAPE = mean(Price.APE)) 
## Error in group_by(., NAME): object 'Test_lag' not found
ggplot(Test_lag) +
  geom_point(aes(MeanPrice, MAPE)) +
  geom_smooth(method = "lm", aes(MeanPrice, MAPE), se = FALSE, colour = "pink") +
  labs(title = "MAPE by Census Tracts as a Function of Mean Price by Tract",
       x = "Mean Home Price", y = "MAPE") +
  plotTheme()
## Error in ggplot(Test_lag): object 'Test_lag' not found

# Mean 
st_drop_geometry(Test_lag) %>%
  group_by(NAME) %>%
  summarize(MAPE = mean(Price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(Mck_tracts) %>%
    st_sf() %>%
    ggplot() + 
   geom_sf(aes(fill = MAPE)) +
      scale_colour_manual(values = palette11,
                          name = "MAPE") +
  geom_sf(data = Test_lag, colour = "grey", show.legend = "point", size = .5) +
      labs(title = "Mean test set MAPE by Block Groups") +
      mapTheme()
## Error in st_drop_geometry(Test_lag): object 'Test_lag' not found

Tidycensus Context

We studied the generalizability across urban contexts using race and income as metrics. One can see that Mecklenburg is segregated by income and race.

ggplot() + geom_sf(data = na.omit(Mck_tracts), aes(fill = raceContext)) +
    scale_fill_manual(values = palette11, name="Race Context") +
    labs(title = "Race in Mecklenburg County,NC") +
    mapTheme() + theme(legend.position="bottom")

ggplot() + geom_sf(data = na.omit(Mck_tracts), aes(fill = incomeContext)) +
    scale_fill_manual(values = c(palette11), name="Income Context") +
    labs(title = "Median Household Income in Mecklenburg County, NC") +
    mapTheme() + theme(legend.position="bottom")

#palette11<- c("#f3e79b","#fac484","#f8a07e","#eb7f86","#ce6693","#a059a0","#5c53a5")

Predictions

Challenge_Meclenburg <-
  Challenge_Meclenburg %>%
  mutate(Price.Predict = predict(reg_training, Challenge_Meclenburg),
         Price.Error = Price.Predict - price,
         Price.AbsError = abs(Price.Predict - price),
         Price.APE = (abs(Price.Predict - price)) / Price.Predict)

#PredictionsExport <- subset(Predictions, select = c("MUSA_ID", "Price.Predict")) %>% st_drop_geometry()

Discussion

In general, our model produced relatively well. 21% of our model can explain the variance of sale prices in Mecklenburg County, with an average sale price error of roughly 130K. We feel that our model is decent at indicating the prediction of sale prices, but maybe not as generalizable in predicting houses in a different regional context. Additionally, It doesn’t capture the nuances within the local neighborhood context.

Variables that provided interesting insight and proceed to be rather significant were the physical characteristics of the house. More obviously, variables like full bathrooms, the number of bedrooms, the age of the house, and the quality of the building grade proved to be significant indicators. Some of the spatial amenities that we added also proved to be a significant indicator in predicting sale prices, including the percentage of white and black residents in an area, and median household income. However, the white population was far more significant than the black population in predicting the sale price of a home. To our surprise, our added variable of residential tree canopy and school buffers were highly significant indicators.

Conclusion

We would not recommend our model to Zillow as it currently stands. Our model would benefit from further transformation and categorization of the home characteristics within the housing data set. Attributes related to the square footage of the home and lot could have been interesting to add to the model. Additionally, in a car-dependent county such as Mecklenburg, we felt that information on garages and parking would have an impact on home prices and individual decisions and preferences. Other external variables such as crime, grocery stores, and flood data would have been good variables to add to the model. Furthermore, our model’s generalizability across space was not as strong as it could be. We are also missing some spatial nuances that impact home prices that could have been achieved through more thorough spatial processes.

The biggest challenge we faced in this project was in debugging our code and dealing with unforeseen errors with the R studio interface. We ran into several situations where our code would initially run, and suddenly stop after. This disallowed us to explore certain processes further due to time constraints. Overall, through this exercise, we were introduced to new tools such as geospatial machine learning and challenged the boundaries of our understanding as well as our capabilities in R.